{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Final transformation of title metadata\n",
    "\n",
    "This is a final stage of processing on the file formerly called ```workmeta.tsv``` -- now ```titlemeta.tsv.``` It addresses some errors that slipped into the file, especially re: the \"numbers of copies\" associated with particular records. \n",
    "\n",
    "At the same time, we absorb the probabilistic information in recordmeta.tsv and migrate it across to the smaller titlemeta file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "from scipy.stats import pearsonr\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "title = pd.read_csv('../titlemeta.tsv', sep = '\\t', index_col = 'docid',\n",
    "                   low_memory = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "record = pd.read_csv('../recordmeta.tsv', sep = '\\t', \n",
    "                     index_col = 'docid', low_memory = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fixing the \"copies\" calculation\n",
    "\n",
    "So, in the second deduplication notebook, I\n",
    "\n",
    "    a) group volumes that share the same author and title and\n",
    "    b) count the number of instances of those volumes\n",
    "\n",
    "I then used those counts to fill ```allcopiesofwork``` and ```copiesin25yrs.``` The only problem is that multi-volume works (e.g. triple-decker novels) got credit for *all three volumes*; thus if you construct a list of the most-reprinted books, they're almost all three and four-volume novels!\n",
    "\n",
    "\"Ah,\" you'll say, \"your error was that you should have grouped volumes that shared the same author, title and *enumcron/volume number*.\" That way vol 2 of *Ivanhoe* would only get credit for other volume twos of *Ivanhoe.*\n",
    "\n",
    "Initially plausible, except that deduplication is often merging records with different numbers of volumes--e.g. two-, three-, and one-volume editions of Ivanhoe!\n",
    "\n",
    "Fortunately, we saved the groupings created in the second deduplication notebook. This makes a more principled approach possible: go back and and count distinct record ids in each grouping, and create a sum total of copies by adding up instances for each *record*, not each *volume.*\n",
    "\n",
    "#### first get the record IDs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "copiesperdoc = dict()\n",
    "recordfordoc = dict()\n",
    "\n",
    "for idx in record.index:\n",
    "    instances = int(record.loc[idx, 'instances'])\n",
    "    r = record.loc[idx, 'recordid']\n",
    "    copiesperdoc[idx] = instances\n",
    "    recordfordoc[idx] = r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### then read in the groupings created by the second dedup notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "groupdict = dict()\n",
    "with open('../dedup/allgroups.tsv', encoding = 'utf-8') as f:\n",
    "    for line in f:\n",
    "        docs = set(line.strip().split('\\t'))\n",
    "        for d in docs:\n",
    "            groupdict[d] = docs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### finally, add up the fractional instance values associated with titles\n",
    "\n",
    "For each title, find the associated group. Add up all the fractional instance values, or (in the case of ```copiesin25yrs,``` only those found in the 25 years after a title's first appearance in Hathi.)\n",
    "\n",
    "This will end up producing a value that is *roughly* the number of copies of a book's complete text found in Hathi. Note that since vols of a set are not always present in the same number of instances, it's quite easy to get fractional values here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "allcop = []\n",
    "cop25 = []\n",
    "\n",
    "for doc, row in title.iterrows():\n",
    "    \n",
    "    group = groupdict[doc]\n",
    "    allcopies = Counter()\n",
    "    copies25 = Counter()\n",
    "    thisdate = row.inferreddate\n",
    "    \n",
    "    for d in group:\n",
    "        if ':/' in d:\n",
    "            d = d.replace(':', '+')\n",
    "            d = d.replace('/', '=')\n",
    "        ddate = record.loc[d, 'inferreddate']\n",
    "        r = recordfordoc[d]\n",
    "        instances4d = copiesperdoc[d]\n",
    "        if r not in allcopies:\n",
    "            allcopies[r] = [instances4d]\n",
    "        else:\n",
    "            allcopies[r].append(instances4d)\n",
    "            \n",
    "        if thisdate + 25 >= ddate:\n",
    "            if r not in copies25:\n",
    "                copies25[r] = [instances4d]\n",
    "            else:\n",
    "                copies25[r].append(instances4d)\n",
    "        \n",
    "    ac = 0\n",
    "    c25 = 0\n",
    "    for k, v in allcopies.items():\n",
    "        ac = ac + np.mean(v)\n",
    "    for k, v in copies25.items():\n",
    "        c25 = c25 + np.mean(v)\n",
    "        \n",
    "    allcop.append(ac)\n",
    "    cop25.append(c25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### some EDA to confirm I did that roughly right"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.96771458781090858, 0.0)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pearsonr(allcop, title.allcopiesofwork)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.93577806952108133, 0.0)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pearsonr(cop25, title.copiesin25yrs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.68784454622681268, 0.0)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pearsonr(allcop, title.copiesin25yrs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "title = title.assign(allcopiesofwork = allcop)\n",
    "title = title.assign(copiesin25yrs = cop25)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['oldauthor', 'author', 'authordate', 'inferreddate', 'latestcomp',\n",
       "       'datetype', 'startdate', 'enddate', 'imprint', 'imprintdate',\n",
       "       'contents', 'genres', 'subjects', 'geographics', 'locnum', 'oclc',\n",
       "       'place', 'recordid', 'instances', 'allcopiesofwork', 'copiesin25yrs',\n",
       "       'enumcron', 'volnum', 'title', 'parttitle', 'earlyedition',\n",
       "       'shorttitle', 'nonficprob', 'juvenileprob'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "title.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(138137, 29)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "title.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "title.to_csv('../titlemeta.tsv', sep = '\\t', index_label = 'docid')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
